Skip to content

Fix schema option not working #946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 21, 2025
Merged

Conversation

payala
Copy link

@payala payala commented Mar 13, 2025

Adding a pydantic schema to SmartScrapeGraph was not working because the format instructions were being appended to the prompt and that was breaking the prompt template variable parsing.

This "IMPORTANT: " appended text is removed, since the format_instructions are anyway added to the prompt being passed as variables, and this is what is breaking the prompt when a schema is passed.

This is my first contribution to this project, I tried to follow all the guidelines, let me know if there is something I should do differently please.

VinciGit00 and others added 14 commits March 9, 2025 15:09
## [1.41.0](ScrapeGraphAI/Scrapegraph-ai@v1.40.1...v1.41.0) (2025-03-09)

### Features

* add CLoD integration ([4e0e785](ScrapeGraphAI@4e0e785))

### Test

* Add coverage improvement test for tests/test_generate_answer_node.py ([6769c0d](ScrapeGraphAI@6769c0d))
* Add coverage improvement test for tests/test_models_tokens.py ([b21e781](ScrapeGraphAI@b21e781))
* Update coverage improvement test for tests/graphs/abstract_graph_test.py ([f296ac4](ScrapeGraphAI@f296ac4))

### CI

* **release:** 1.41.0-beta.1 [skip ci] ([7bfe494](ScrapeGraphAI@7bfe494))
## [1.42.1](ScrapeGraphAI/Scrapegraph-ai@v1.42.0...v1.42.1) (2025-03-12)

### Bug Fixes

* add new gpt model ([cff799b](ScrapeGraphAI@cff799b))
## [1.43.0](ScrapeGraphAI/Scrapegraph-ai@v1.42.1...v1.43.0) (2025-03-13)

### Features

* add intrgration for o3min ([fc0a148](ScrapeGraphAI@fc0a148))
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working tests Improvements or additions to test labels Mar 13, 2025
Copy link
Contributor

codebeaver-ai bot commented Mar 13, 2025

I opened a Pull Request with the following:

🔄 4 test files added and 7 test files updated to reflect recent changes.
🐛 Found 1 bug
🛠️ 94/133 tests passed

🔄 Test Updates

I've added or updated 8 tests. They all pass ☑️
Updated Tests:

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_json

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_xml

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_csv

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_txt

  • tests/graphs/abstract_graph_test.py 🩹

    Fixed: tests/graphs/abstract_graph_test.py::TestAbstractGraph::test_create_llm[llm_config5-ChatBedrock]

  • tests/graphs/abstract_graph_test.py 🩹

    Fixed: tests/graphs/abstract_graph_test.py::TestAbstractGraph::test_create_llm_with_rate_limit[llm_config5-ChatBedrock]

  • tests/utils/test_proxy_rotation.py 🩹

    Fixed: tests/utils/test_proxy_rotation.py::test_parse_or_search_proxy_success

New Tests:

  • tests/test_generate_answer_node.py

🐛 Bug Detection

Potential issues:

  • scrapegraphai/utils/research_web.py
    The error is occurring in the test_google_search function. The test is expecting exactly 2 results from the search_on_web function, but it's receiving 4 results instead. This mismatch is causing the assertion to fail.
    Let's break down the problem:
  1. The test is calling search_on_web("test query", search_engine="duckduckgo", max_results=2).
  2. The function is expected to return 2 results (as specified by max_results=2).
  3. However, the function is actually returning 4 results.
    This suggests that the search_on_web function is not correctly limiting the number of results to the specified max_results parameter when using the DuckDuckGo search engine.
    The issue is likely in the implementation of the DuckDuckGo search in the search_on_web function. Specifically, in this part of the code:
if search_engine == "duckduckgo":
    research = DuckDuckGoSearchResults(max_results=max_results)
    res = research.run(query)
    results = re.findall(r"https?://[^\s,\]]+", res)

The DuckDuckGoSearchResults object is created with the correct max_results, but the results are then extracted using a regex pattern. This regex extraction might not be respecting the max_results limit.
To fix this, the code should explicitly limit the number of results after the regex extraction:

results = re.findall(r"https?://[^\s,\]]+", res)[:max_results]

This change would ensure that no more than max_results URLs are returned, regardless of how many are found by the regex.

Test Error Log
tests/utils/research_web_test.py::test_google_search: def test_google_search():
        """Tests search_on_web with Google search engine."""
>       results = search_on_web("test query", search_engine="Google", max_results=2)
tests/utils/research_web_test.py:10: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
query = 'test query', search_engine = 'google', max_results = 2, port = 8080
timeout = 10, proxy = None, serper_api_key = None, region = None
language = 'en'
    def search_on_web(
        query: str,
        search_engine: str = "duckduckgo",
        max_results: int = 10,
        port: int = 8080,
        timeout: int = 10,
        proxy: str | dict = None,
        serper_api_key: str = None,
        region: str = None,
        language: str = "en",
    ) -> List[str]:
        """Search web function with improved error handling and validation
    
        Args:
            query (str): Search query
            search_engine (str): Search engine to use
            max_results (int): Maximum number of results to return
            port (int): Port for SearXNG
            timeout (int): Request timeout in seconds
            proxy (str | dict): Proxy configuration
            serper_api_key (str): API key for Serper
            region (str): Country/region code (e.g., 'mx' for Mexico)
            language (str): Language code (e.g., 'es' for Spanish)
        """
    
        # Input validation
        if not query or not isinstance(query, str):
            raise ValueError("Query must be a non-empty string")
    
        search_engine = search_engine.lower()
        valid_engines = {"duckduckgo", "bing", "searxng", "serper"}
        if search_engine not in valid_engines:
>           raise ValueError(f"Search engine must be one of: {', '.join(valid_engines)}")
E           ValueError: Search engine must be one of: searxng, duckduckgo, serper, bing
scrapegraphai/utils/research_web.py:45: ValueError

☂️ Coverage Improvements

Coverage improvements by file:

  • tests/nodes/fetch_node_test.py

    New coverage: 71.30%
    Improvement: +71.30%

  • tests/graphs/abstract_graph_test.py

    New coverage: 71.88%
    Improvement: +71.88%

  • tests/utils/test_proxy_rotation.py

    New coverage: 0.00%
    Improvement: +0.00%

  • tests/test_generate_answer_node.py

    New coverage: 85.71%
    Improvement: +8.73%

🎨 Final Touches

  • I ran the hooks included in the pre-commit config.

Settings | Logs | CodeBeaver

@VinciGit00
Copy link
Collaborator

HI @payala,

could you please add a screenshot of results?

@payala
Copy link
Author

payala commented Mar 21, 2025

You mean this @VinciGit00 ?
image

@VinciGit00
Copy link
Collaborator

Yes thx

@VinciGit00 VinciGit00 changed the base branch from main to pre/beta March 21, 2025 08:17
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 21, 2025
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 21, 2025
@VinciGit00 VinciGit00 merged commit 16de81f into ScrapeGraphAI:pre/beta Mar 21, 2025
4 checks passed
Copy link

🎉 This PR is included in version 1.43.1-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

🎉 This PR is included in version 1.43.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer released on @dev released on @stable size:L This PR changes 100-499 lines, ignoring generated files. tests Improvements or additions to test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants